-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[ENH] Quantized Spann Segment Writer #6397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
8bded9c to
70b715c
Compare
This comment has been minimized.
This comment has been minimized.
15768fa to
e3957b5
Compare
|
Feature-Gating Quantized SPANN Segment Writer Pipeline This PR introduces a complete quantized SPANN segment writer/flush pipeline, wiring it through the segment manager, schema helpers, and feature gating so quantized segments can be produced and reopened safely. The effort separates Key Changes• Added Possible Issues• Quantized centroid/metadata files are not included in This summary was automatically generated by @propel-code-bot |
This comment has been minimized.
This comment has been minimized.
e3957b5 to
0a585d2
Compare
| QuantizedSpannSegmentError::Config(format!( | ||
| "failed to parse record segment file path: {e}" | ||
| )) | ||
| })?; | ||
| let options = BlockfileReaderOptions::new(id, prefix.to_string()); | ||
| let reader = blockfile_provider.read(options).await.map_err(|e| { | ||
| QuantizedSpannSegmentError::Config(format!( | ||
| "failed to open record segment reader: {e}" | ||
| )) | ||
| })?; | ||
| Some(reader) | ||
| } | ||
| None => None, | ||
| }, | ||
| None => None, | ||
| }; | ||
|
|
||
| // Order matches file_path_keys: cluster[0], embedding_metadata[1], | ||
| // quantized_centroid[2], raw_centroid[3], scalar_metadata[4]. | ||
| let file_ids = QuantizedSpannIds { | ||
| embedding_metadata_id: parsed[1].1, | ||
| prefix_path: prefix_path.clone(), | ||
| quantized_centroid_id: IndexUuid(parsed[2].1), | ||
| quantized_cluster_id: parsed[0].1, | ||
| raw_centroid_id: IndexUuid(parsed[3].1), | ||
| scalar_metadata_id: parsed[4].1, | ||
| }; | ||
| QuantizedSpannIndexWriter::open( | ||
| cluster_block_size, | ||
| vector_segment.collection, | ||
| spann_config, | ||
| dimensionality, | ||
| distance_function, | ||
| file_ids, | ||
| cmek, | ||
| prefix_path.clone(), | ||
| raw_embedding_reader, | ||
| blockfile_provider, | ||
| usearch_provider, | ||
| ) | ||
| .await? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Logic] apply_materialized_log_chunk() now hard-fails whenever the materialized log record doesn’t contain an embedding inline. In production, materialize_logs() commonly hydrates embeddings from the record segment (the log itself often omits them after compaction), so legitimate AddNew/OverwriteExisting operations will now panic with QuantizedSpannSegmentError::Data. You should accept the RecordSegmentReader that’s already being passed to VectorSegmentWriter::apply_materialized_log_chunk, hydrate the record when embeddings_ref_from_log() returns None, and only error when both sources are missing. For example:
pub async fn apply_materialized_log_chunk(
&self,
record_segment_reader: &RecordSegmentReader<'_>,
materialized_chunk: &MaterializeLogsResult,
) -> Result<(), ApplyMaterializedLogError> {
for record in materialized_chunk {
let embedding = match record.embeddings_ref_from_log() {
Some(v) => Cow::Borrowed(v),
None => Cow::Owned(
record
.hydrate(record_segment_reader, 1)
.await?
.embedding
.to_vec(),
),
};
self.index.add(record.get_offset_id(), &embedding).await?;
}
Ok(())
}Without this fallback any replay that relies on the record segment (which is the default case) will immediately fail.
Context for Agents
`apply_materialized_log_chunk()` now hard-fails whenever the materialized log record doesn’t contain an embedding inline. In production, `materialize_logs()` commonly hydrates embeddings from the record segment (the log itself often omits them after compaction), so legitimate `AddNew`/`OverwriteExisting` operations will now panic with `QuantizedSpannSegmentError::Data`. You should accept the `RecordSegmentReader` that’s already being passed to `VectorSegmentWriter::apply_materialized_log_chunk`, hydrate the record when `embeddings_ref_from_log()` returns `None`, and only error when both sources are missing. For example:
```rust
pub async fn apply_materialized_log_chunk(
&self,
record_segment_reader: &RecordSegmentReader<'_>,
materialized_chunk: &MaterializeLogsResult,
) -> Result<(), ApplyMaterializedLogError> {
for record in materialized_chunk {
let embedding = match record.embeddings_ref_from_log() {
Some(v) => Cow::Borrowed(v),
None => Cow::Owned(
record
.hydrate(record_segment_reader, 1)
.await?
.embedding
.to_vec(),
),
};
self.index.add(record.get_offset_id(), &embedding).await?;
}
Ok(())
}
```
Without this fallback any replay that relies on the record segment (which is the default case) will immediately fail.
File: rust/segment/src/quantized_spann.rs
Line: 187There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AddNew/OverwriteExisting requires embedding to be present as a system invariance
22ff37f to
fa35f48
Compare
fa35f48 to
b02eb80
Compare

Description of changes
Summarize the changes made by this PR.
QuantizedSpannIndexWriterto facilitate segment writercommitto a separatefinishQuantizedSpannSegmentWriter, under the feature flagVectorSegmentWriterto use the new writer implTest plan
How are these changes tested?
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?